We were highly motivated by the extraordinare possiblity of working with the covid pandemic. It found it especially interesting because it was a very present issue that affected us all.
Corona-related data sets
Corona Cases:
Geographic distribution of COVID-19 cases worldwide, by ECDC (European Centre for Disease Prevention and Control) (https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide: https://opendata.ecdc.europa.eu/covid19/casedistribution/csv)
Corona testing:
OWID dataset regarding testing in different countries. (https://ourworldindata.org/coronavirus-testing)
Economical data sets
Flight monitoring:
"Flight tracking statistics", by Flightradar24. Time series of 'total' number of flights and 'commercial' flights tracked by the system.(https://www.flightradar24.com/data/statistics).
Oil prices:
Europe Brent and WTI (Western Texas Intermediate) Spot Prices. (https://datahub.io/core/oil-prices, https://datahub.io/core/oil-prices/r/brent-daily.csv, https://datahub.io/core/oil-prices/r/wti-daily.csv)
Natural Gas prices:
Natural Gas Prices, including US Henry Hub. Data from U.S. Energy Information Administration EIA (https://datahub.io/core/natural-gas, https://datahub.io/core/natural-gas/r/daily.csv)
Stock index and 10y governmental bonds history:
5 years of closing values on a select group of countries index'es an bonds. Retreived by copying data fatched by an online widget (https://tradingeconomics.com). Also, 4.5 months of history on a selection of indexes made available by another student group on Slack on April 25th 2020 (Nykredit).
Currency exchange rates:
Euro foreign exchange reference values from the Eurpean Central Bank (https://www.ecb.europa.eu/stats/policy_and_exchange_rates/euro_reference_exchange_rates/html/index.en.html: https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip?25f87d6c2bd28fe22263ec93473b6a9c)
General/other data sets
Country data:
Collection of merged data sets with details such as country official codes and names (ISO 3166), currencies (ISO 4217), phone dialing codes (ITU). (https://datahub.io/core/country-codes: https://datahub.io/core/country-codes/r/country-codes.csv)
GIS vector maps:
Country vector maps originally by Natural Earth (https://www.naturalearthdata.com/), but slightly complemented with extra data. (https://geojson-maps.ash.ms / https://datahub.io/core/geo-countries: https://datahub.io/core/geo-countries/r/countries.geojson)
Human Development Index:
Human development Index originally published by the UN (http://hdr.undp.org/en/indicators/137506#), but slightly improved by OWID. Includes index values in years 1980-2017, together with country codes and names. (https://ourworldindata.org/human-development-index)
Amount of the corona related datasets can be overwhelming. The one from ECDC was chosen, because it was the most neutral one. Although, the famous dataset from Hopkins has almost the same data, and even complementary information such as specific geographic coordinates of cases, the level of the detail in the dataset seemed biased towards the USA. Likewise, the DOLT dataset was heavily biased towards Asia.
The ECDC corona data set provided a fine basis for examining the spread of the virus - measured by both: the number of registered cases and deaths. Further, the set was also coded with proper country codes.
The general country dataset was used mainly for two reasons. Firstly, out of the need to translate country codes into country names, but also to have a way to translate between different country coding styles. The chosen dataset was highly relevant in the early state, since at that point currency exchange rates were thought to be a central theme, and this data set also made it easily possible to relate currency codes to country codes.
Vector maps were necessary for map making. However, there are a lot of such datasets of varying quality. While many of such are based on the very same vector data, the data accompanying the vector features were often of questionable quality. Essentially it was preferable to trust other, more original sources and use only the provided vector data.
Although stories in the media referenced data regarding subjects such as quickly rising unemployment, this data was not easily accessible. In fact, it seemed not possible to gather such data for an up-to-date analysis of the global situation. This is probably due to the fact, that datasets about the economical state of each country, tend to be published as part of a periodical retrospective analysis. Therefore, other economic data, describing the day-to-day development in parallel to the spread of corona, had to be found.
Initially, currency exchange rates were chosen as a highly relevant data source. However, it turned out that such exchange rate datasets are of course relative to a chosen currency - typically USD. This would describe the global development relative to the development of the country, of the chosen currency, which could be problematic. Further on, currency exchange rates of course turned out to be a bad measurement for comparing the European countries due the to the usage of Euro across many of these countries. Even the Danish krona is functionally locked to the Euro. In the end, currency rates were abandoned in favor of government bonds.
However, government bonds also turned out to be an issue, since not all countries have short-lived government bonds. Among the selected focus countries some did only have 10-year bonds. Such a timeframe would definitely reduce the impact of the very present situation with the corona pandemic. Bonds were therefore also deselected for inclusion on the final production.
In the end, national stock indexes were chosen as an indicator for the present development in the countries. Using such indexes is not ideal since they only include a narrow selection of the most popular stocks. Also, the companies included are often actors on the global market, and the indexes have an unintended international bias. Further on, they do not contain or describe the smaller businesses nor privately owned. However, the indexes describe the investors willingness to invest in the most valued businesses of each country. In the end, for a smaller analysis like this, such bias is considered quite acceptable when trying to compare how the corona pandemic affect the individual countries.
The flight tracking, oil and natural gas datasets were chosen to complement the stock indexes, as other up-to-date sources describing the global effects of the pandemic. These were chosen simply because of how they showed an unprecedented, abrupt, global change.
Human Development Index is a statistic index, which aggregates information about: life expectancy, years of schooling and Gross national income (per capita) of the countries and expresses them in a value between 0 and 1. According to this metric as of 2018, most developed country in the world is Norway and least developed one is Niger. We wanted to investigate how such a socio-economic index correlate to the Covid outbreak.
Finally, none of the Covid cases would be recorded, if not for the test performed on patients. The tests allow the governments to track the outbreak and plan their next countermeasures. We considered, quite relevant to take a closer look at how many tests are done by each country, and if the amount of testing can lead to smaller amount of deaths.
We chose to do a magazine article type of production and intended to produce also some actual analyses. However, in the end we had not enough time to also produce such, and the production will therefore somewhat lacky on that form. Nonetheless, we believe that the visualizations would be a great support for the intended text of an article.
The goal for the reader should be to get a comprehensive overview about global Covid pandemic situation. By investigating the crisis from different angles, we are able to understand current affairs better and more in-depth. The user should be able to interact with visualizations provided, explore the topic in detail and perhaps draw some new conclusions.
The actual data import, cleaning and preprocessing, is done via the homemade python modules (own.data.*). Generally, such modules load csv or json data files into Pandas dataframes or series and store them in variables in the modules. Besides csv files, json files are also imported. These have the capacity to have nested lists and maps, which require more manual processing than two-dimensional csv files. Such is also done in the homemade modules.
Where relevant, data types for categorical columns are set already on import. Date columns are cast into datatime, and when dates are separated into date, month and year, these are merged into one column. Also, values such as NaN and booleans are cleaned into corresponding python data structures. Finally, column naming is typically fixed to an internal string.
For easy data retrieval, the modules also contain convenient methods that provide easy access to the data in various forms. This could be in cumulated form or merged with other data sets.
Generally, three-letter country codes are used across all data set whenever possible. The codes used was expected to simply follow the ISO 3166-1 alpha-3 standard, but abnormalities appeared since some countries are not recognized by all organizations. Kosovo (KOR) is for instance recognized by EU, but not internationally by UN. Also, there are issues with sovereignty, polluting some country coding.
In the folloowing we present the different structure and content of the different datasets.
The actual data import, cleaning and preprocessing, is done via the the homemade own.data.ecdc module as previously described.
The ECDC data set consists of one csv file, that will have to be imported into a pandas dataframe. The raw structure are af follows:
A few countries have such as Anguilla and Nambia has missing country codes. For now, these are simply skipped, but could be preserved by manual data cleaning.
The date and country columns are then set a dual multi-index.
The population size of each country is extracted and preserved in a separate variable.
Finally, only relevant columns a preserved
The own.data.ecdc module also contains a range of helper emthods that provide easy access to various combinations of the data. Some of these methods return the data unstacked()'ed, and some add cumulative values.
The number of daily deaths per country vary a lot.
However, so does the size of the populations
Geospacial data such as country borders, are imported via geojson files, that contain features (rows) with both a geospatial column and possibly also other data. These are imported with GeoPandas, a geospacial extension of Pandas, which makes the data mergable with other data frames. The geojson files used, also contain some relevant ordinal columns, and these are properly ordered. Also, the file includes some convenience columns for map coloring, but thse are not used. Most of the columns from the geojson fiels were not used, but left as is. In the end the prime relevance were the geospacial columns.
The actual import and porcessing is done via the homemade own.data.countries module.
Country data aggregates data from: governmental agencies, UN and some independent projects, about countries in the world. It helps us by providing mapping between country names and country codes. As the common names tend to vary in different sources, one parameter which usually remains the same is the country code (either two-letter or three letter). Furthermore, it includes information about continents, regions and sub-regions, which allow to group countries together, as well as development classification (established by United Nations) and country independency status (established by CIA).
Country data was placed in the module own.data.countries
As we mentioned in the description above, the dataset is very comprehansive. Thus we don't always will need all the columns. Below we show, which information can be extracted from the main dataframe with the module.
The dataset also containes details about the countries development status, which is made available separately
Oil data we found, comes from two sources and shows prices in USD per barrel). The first one (BRENT) refers to European oil prices in years 1987-2020, while the second datset (WTI CRUDE) to American ones in years 1986-2020. Oil data is stored in the module: own.data.oil
Because we wanted to keep our analysis global and also because the difference between those regions is not significant we calculated the average oil price and use it further analysis. Oil prices dataset is stored in the module: own.data.oil
Gas data is provided by U.S. Energy Information Administration EIA and provides an overview of gas prices (in USD) in the United States in years 1997-2020. Data is stored in the module: own.data.gas
This dataset is combined out of two subsets. First one describes total amount of flights. This basically includes all the possible flying objects flightradar was able to track and store in database. While, the second is narrowed down to commercial passenger, cargo, charter and some business flights. The actual raw data consists of more than one file, as we gathered the data over few days (flight radar tracks down the flights only 90 days back). Therefore, the presented data is actually a couple of csv files accumulated into two raw datasets. Two datasets are rather similar and contain total number of flights as well as 7-day average. Flights data is stored in the module: own.data.flights
The only processing done at this stage is two combine the dataframes into one, which has all the gathered data. As one can observe below we are missing commercial flights data. This was due to the inclusion of a an older data file, on total flights only.
Stock data are loaded form multiple json files - one per country. While the whole original time series are kept, only the closing values are processed and used. These are processed into one collective pandas dataframe.
The actual import and processing is done via the homemade own.data.stocks2 module.
Testing data from OWID contains details about not only abut testing, but also cases and deaths, as well as their normalization. Timespan of the dataset spreads between the and on December 2019 and end of April 2020. The module responsible for loading this data: own.data.owid
We filter out information, which is not test related, as we take the data on COVID from ECDC dataset. After further, inspection we notice that 128 countries lack infromation about tests. Furthermore the units of the tests varies a lot between countries, as OWID classifies them into 10 different categories.
HDI dataset provides values of HDI for 190 countries in years 1980-2017, together with country codes and names. Datset is loaded and processed by module: own.data.hdi
In order to correlate the HDI data to other datsets we have merged it with: country data, tests and ecdc data. This allows us to easily corralate various data or group countries by regions, continents or economic factors. We also caluclate tests per capita.
During initial data analysis, another group's stock data were used as inspiration. This set were made available via Slack. The excel file required different processing, since the data was not provided with dates per row, but rather by column. The actual import and processing is done via the homemade own.data.stocks module.
In the end, since data was limited to April 15th 2020, it was only used for inspiration and picking out relevant stock indexes per country.
One thing is the number of daily deaths, another is how these cumulate over time. During the previous data exploration, it was obvious that USA generally had more daily deaths and therefore would have also more total deaths. The other countries were a blurrier story. It turns out that a group of larger western countries dominates the charts.
However, as the population size varies, so would the number of deaths. When calculating the total number of deaths per capita, there is quite a different distribution. Two very small states, San Marino and Andorra now appear at the top three, but the order of the other countries has shifted a great deal. Most noticeably, USA now by no means the leading country. Surprisingly, Belgium is taking the lead (if the two micro-states are ignored).
These bar plots do, however, not provide proper insight into the development over time. Plotting the timeseries of cumulated number of deaths of each of the 30 worst affected countries, illustrates how differently the pandemic has spread across these countries. But yet again the per-capita-version show that other side of the truth.
Limiting the number of countries to a selection of 10 focus countries further highlights this difference.
For the record: the selection mainly includes a mix of mainly European countries that. When measured by total number of deaths could roughly be split into three groups, but when measured per capita, they spread out nicely and with a different order. USA and Belgium were specifically selected as the most affected countries. China for being the patient-0 country. Denmark and Sweden were selected due to their overall similarity, yet noticeable differences in countermeasures against Corona.
These datasets were already plotted in the previous section, but with a much longer timeseries that necessary. When focusing on only the recent months, it is clear that the oil prices were apparently affected, but that the gas prices have actually been on a steady level during the corona crises.
These datasets were already plotted in the previous section, and the plot clearly showed how the number of registered flights fell as the epidemic turned into a pandemic. There should be no further analyses necessary.
As we have seen in the previous chapter, it might be hard to compare absolute values for countries. Similar behavior occurs in stocks. The situation is even more complicated, as the values are determined on the country level, of the national stock exchange. Therefore, we decided to normalize each index over the value it had in the beginning of 2020.
By plotting the normalized values, we are able to understand how stock exchanges were impacted by COVID and compare, which countries were affected the most. It is clear that China's index was impacted first, as the outbreak there began earlier then in Western countries. China experienced a second decline around the time the Western countries experienced it too. However, it's the smallest of all the countries shown in the graph. We can observe how stocks start to drop around the end of February, when the first deaths in Italy and Spain were recorded. The decrease in value continues for around two weeks, until the mid-March, when many European countries started to enforce lockdowns. What makes this figure interesting, is the notable increase of value in the end of March, when the restrictions were becoming more and more severe. This tendency continues, with varied gain, till the beginning of May.
If we look at the latest days shown in the plot, we can easily name countries indexes, which were least and most impacted by the outbreak. Three most impacted ones are: Spanish, Italian and French indexes, while three least impacted are: Danish, Chinese and American ones.
In order to combine data about cases, deaths and tests, with HDI we had to make a choice exclude some countries. In particular those, which didn't have any data on tests. Furthermore, we were interested in the most recent data, but the actual time span of the dataset is between 13th and 23rd of April. This due to the fact, the for each of the countries, we recorded only the latest date on which all three values (cases, deaths and tests) were available. In summary in this analysis we consider 74 countries all over the world.
As we have seen previously with cases and deaths, total numbers might be not a fair metric when comparing country related data. Therefore, in further analysis we shall only use values per capita.
Let's take a look at tests alone. Countries, which perform a lot of tests per capita are either small, located in Europe or both. As one might expect, underdeveloped and poor countries perform significantly less tests. This leads us to conclusion that, it's hard to estimate the impact of COVID in those parts of the world.
Finally, if we look at the focus countries, we observe that Italy performs the most, while the UK the least tests per capita.
As the final remark it is important to understand, why some countries perform less tests. Human development index might give us an answer. The more the developed the country, the more tests it can perform, or better - can afford. According to this source the price range of a test in the USA is between 35-50 USD. So, as long as those tests will not become affordable or accessible for the developing countries, we won't be able to understand how badly is the COVID crisis impacting them.
Such complex topic like COVID-19 outbreak, could be explained in all possible genres of visualizations. We decided to choose the old-fashioned magazine style. It allows to include meaningful visualizations, while leaving some room for explanation written in text.
Either way, we believe that the production serves the purpose of providing the reader relevant information about the spread corona pandemic, how it impacts a small handfull of economical aspects on both the global and national level, and even how differently the situation is experienced in developing and developed contries. This last metric is specially meaningful for the test data, as the testing allows countries to monitor and track the health crisis.
Visual Structuring
Highlighting
Transition Guidance
Ordering
Interactivity
Hover Highlighting / Details
Filtering / Selection / Search
Navigation
Buttons
Stimulating Default Views
Messaging
Captions / Headlines
Introductory Text
Summary / Synthesis
Rather than just referring to the final visualizations, we choose to also generate and display the very thing. The plots are generated by a handfull of notebooks, that is prparet for both saving plots to html files, showing the plots in the notebook, or saving them in a special variable for usage elsewhere. Here the later is of cause the case. A special python module (own.funny), which is also used by all these notebooks, changes the behavior of the notebooks.
We have chosen to start out with a choropleth map that show the current spread of the corona virus with a colormap that really highlights the differences between the countries. The counters are affected to a very different extend and using a traffic-light colormap that mirrors that the user would normally see on google maps' the traffic overlay, we seek to assist the reader into quickly understand the situation.
A choropleth is an often-good way to visualize a geographical difference, when these can be split into non-overlapping polygons. By starting out with such a visualization, we naturally also seek to catch the readers interest.
The choropleth plugin provides the basic mapping functionalities such a background image and zooming and panning capabilities in a way that the readier is accustomed to. On top of this we add a layer of polygons that match the countries, and coloring these according to the colormap. Also, we add tooltip hover boxed that contains various details about the country the curser is currently above. Thereby we prove a way for the reader to also zoom into the data and not only zoom in on the map.
However, a static choropleth does not give any insights into the temporal dimension. We therefore compliment the former choropleth with an animated version on which the reader can control time with a slider. We would have preferred adding the possibility of passing through time automatically, but in the end, we had to choose between different tools that each had their limitations, and we chose to lose automatic animation in favor of a tool that provided more appealing visualization with less effort.
For the times choropleth we neutrally use the very same colormap as the previous choropleth. However, we compliment the gradual change in color with a gradual increase in alpha value. Thereby we start out with a naked map, and the colored country polygons individual fades in as people begin to die in the corresponding country.
The plot only offers little functional interactivity: A hover tooltip shows the population size with higher precision than the value located on top of each bar.
This particular plot has show to be a bit troublesome when added like this, so the generated html file is also shown as a backup...
We could easily have kept on using choropleth for also showing country population size, and by doing so, we could present the size of all countries. However, we have chosen a set of focus countries, and a bar plot is better suited for showing the actual size difference. With a choropleth the reader would have to interpret a color into quantity/amount, but with a bar plot the size difference is visualized in a way show the differences in a way that make it much easier to determine size differences - and also comparing them.
Now, development of the virus in the ten focus countries are visualized together on two line plots. First, the total number of deaths per country is shown on one plot, and after that, another diagram shows deaths normalized to per capita. By displaying total deaths and deaths per capita on two different plots close to each other, it is possible to see observer the obvious difference in development.
By using distinct colors per country, the reader is assisted in locating the same country on the other plot - and on later plots.
Compared to the choropleths, the first plot with total deaths show the exact development per day per country on the y-axis, and not just a color which would only give an estimated value. Thereby it is possible to identify even small differences between two countries, which would be hard or even impossible on a choropleths.
By showing only a handful of focus countries instead of the whole lot, makes it possible to do a line plot like this in a meaningful way.
The legend allows for 'muting' the individual country. This would have allowed for inclusion of further countries and still have a meaningful uncluttered plot. However, the legend would grow linearly alongside the number of lines, and also the usage of distinct coloring per country would be lost.
Locating the y-axis on the right side, where the number of deaths grow, makes reading the latest values easier than if the z-axis were located on the left.
The plot contains further data that initially shown, since it has been zoomed into the most interesting period. It is therefore possible to pan to the past and zoom further in on the earlier stages.
And finally, a hover tooltip box of cause allows for exact reading of dates and deaths.
Altogether, the two plots successfully demonstrates the differences between development in total deaths and deaths per capita.
By combining the global death count with the number of registered flights, it is possible to give the reader a feeling of just how closely the abrupt decline in airline traffic corresponds to the sudden increase in the global death count.
While the 7-day average is properly the most relevant, the inclusion of also daily counts on thinner and semi-transparent lines, the primary 7-day average lines is somewhat legitimized. And by pre-muting the daily lines, they are reduced to a noticeable shadow that does not call for attention, but still serve their purpose.
Since the y-values of the two datasets are quite different, a secondary axis is added to the right.
Altogether, the plot gives a good impression how the decrease in airline traffic caused a sudden rise in corona related deaths. just kidding, naturally...
The plot with oil and natural gas prices basically follows the same pattern as the previous plot.
Contrary the flight traffic plot, this plot contains a long history of oil and gas prices. Plot is pre-zoomed into the most recent prices, since these are the relevant ones, but it is possible to pan back in time and zoom, should the reader be interested.
The plot shows values of most important stock indexes for each of the focus countries. We used a line plot, which is a common choice for visualizing time-series. By showing the deaths, we are able to prove that the biggest companies are not impacted by increasing number of deaths worldwide. However, it might be hard to investigate the situation in different countries, because absolute values are not comparable to each other, as the indexes are listed on national stocks exchanges.
We continue with the time-series visualizations with line plot, but this time we normalize over fixed data of the 2nd of January of 2020. This date was chosen as a reference point, because it was the very beginning of the COVID outbreak. It was also the time, when the markets weren't reacting to the crisis just yet. Here the lines become a little occluded, so the user is able to select the countries they are interested in by clicking on the legend.
Which choose scatter plot for this visualization as it provides a good of overview of the 74 countries, which published their data on COVID testing. Both axes are customizable, as the user can set different categories (cases, deaths and tests per capita) on Y axis. While, for the X axis user is able to choose between human development (HDI) and tests per capita. This allows to not only understand how cases and death depend on development level but also on tests performed by each country. With this visualization, we can i.e. see if the monitoring (tests) can lead to decrease in deaths. With the hover tool the reader is able learn the important details about the given country. Finally, if the plot becomes a little occluded, the user is able to choose the group of countries belonging to few of 7 development categories and hide the others.
Choosing corona-related datasets as basis for our final project showed to be a risky choice. The outbreak is still developing and therefore the situation can change over a day or a week. We have tried, to find the best datasets, which would allow us to explore the impact of the epidemic in various areas of economy, but we also understand the spread of the disease better by looking at tests and HDI datasets. Searching for good data sets was quite time consuming, and in the end a many of the datasets identified were not used in the final website.
There are couple of aspects, we are glad we archived. With our article, the reader is able to get an impression of the extend covid crisis. In all sections - from the spread of the virus measured by the number of deaths, over the financial impact of the pandemic - the readers can compare situation in the different countries. And finally, and finally the reader can explore the data themselves thorough interactive a small dashboard describing how testing can be related to the Human Development Index.
However, our website by no means perfect. We initially chose to do a magazine style article with a good handful of visualizations, but in the end, we did not have time to do anything near the amount of article text as we had planned. We therefore ended up with a production that is kind of a mix between an article and annotated charts.
Though the intention was an article where the illustrations supported the text, and not the other way around, either way, we believe that the production servers the purpose of providing the reader relevant information about the spread corona pandemic, how the pandemic impacts a small handful of economic aspects on both the global and national level.
As for the visualizations, we have chosen to limit the number of countries to a handful of focus countries. We could have expanded the visualizations to include a few more countries or made the plots customizable with a complete list of all countries, so that the reader decides, which countries exactly shall be investigated. However, since our intention was a magazine article with a story to tell and not a dashboard, we chose to limit the visualizations to the what was actually needed for the story.
The spread of the pandemic can of cause also be related to population density of a given country, or their demographic composition. Such metrics were also considered, but since we had chosen an economical angle, they were considered out of our scope. Another, topic interesting to explore, would have been how restrictions applied by the countries were set according to the growth in cases and deaths. Furthermore, in our analysis we completely omitted the data on recoveries.
From the economical point of view, there a couple of indicators of how badly the national economies were impacted by the outbreak. The stocks we have found, might not be a good measure of countries economical situation, as they tend to operate worldwide, and their value might not rely solely on national market. The data on unemployment rate, retail sales or increase of national debt, could tell us more about how countries lessen the influence of the outbreak on the economy or how deeply they go into recession. Unfortunately, the only sources we found on those topics were either paid or not up to date.
A general issue with a lot of the datasets was that subjects such as for instance testing and countermeasures does not provide data that is easily comparable. Furthermore, the quality of the health care system in many countries are at best questionable, which could explain why some poorer countries have registered less deaths. However, the uneven spread might in fact be the reality, and would then instead be explained by for instance the richest countries travel pattern. In the end, based on our datasets, we are of cause not able to deduce any such conclusion.
Overall, the project could be individualized as follows:
Heino
Aleksander
Preprocessing has to a vast extend been structures collectively.